Using word n-grams to identify authors and idiolects
نویسندگان
چکیده
منابع مشابه
Beyond Word N-Grams
We describe, analyze, and experimentally evaluate a new probabilistic model for wordsequence prediction in natural languages, based on prediction suffi~v trees (PSTs). By using efficient data structures, we extend the notion of PST to unbounded vocabularies. We also show how to use a Bayesian approach based on recursive priors over all possible PSTs to efficiently maintain tree mixtures. These ...
متن کاملUsing idiolects and sociolects to improve word prediction
In this paper the word prediction system Soothsayer1 is described. This system predicts what a user is going to write as he is keying it in. The main innovation of Soothsayer is that it not only uses idiolects, the language of one individual person, as its source of knowledge, but also sociolects, the language of the social circle around that person. We use Twitter for data collection and exper...
متن کاملVariable word rate N-grams
The rate of occurrence of words is not uniform but varies from document to document. Despite this observation, parameters for conventional n-gram language models are usually derived using the assumption of a constant word rate. In this paper we investigate the use of variable word rate assumption, modelled by a Poisson distribution or a continuous mixture of Poissons. We present an approach to ...
متن کاملSelecting and Weighting N-Grams to Identify 1100 Languages
This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score smoothing, and its implementation in an open-source program. When applied to a collection of strings in 1100 languages containing at most 65 characters each, an average classification accuracy of ove...
متن کاملEnhancing News Articles Clustering using Word N-Grams
In this work we explore the possible enhancement of the document clustering results, and in particular clustering of news articles from the web, when using word-based n-grams during the keyword extraction phase. We present and evaluate a weighting approach that combines clustering of news articles derived from the web using n-grams, extracted from the articles at an offline stage. We compared t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: International Journal of Corpus Linguistics
سال: 2017
ISSN: 1384-6655,1569-9811
DOI: 10.1075/ijcl.22.2.03wri